Time-series: theory (2/2)

This notebook presents the foundational models in time-series beyond the random walk: autoregressive and moving-average processes.

1 Getting started (packages)

library(tidyverse)  # THE library for data science
library(openxlsx)   # A cool package to deal with Excel files/formats
library(quantmod)   # Package for financial data extraction
library(tsibble)    # TS with dataframes framework 
library(fable)      # Package for time-series models & predictions
library(feasts)     # Package for time-series analysis
library(forecast)   # Another package for TS (forecasting)
library(tseries)    # Simple package for time-series tests
library(plotly)     # For 3D graphs
library(mvtnorm)    # For multivariate Gaussian distributions
library(crypto2)    # For cryto data
library(ggsci)      # For color palettes
set.seed(42)        # Random seed

2 The auto-regressive model

2.1 Introduction

Definition

Let us introduce a small yet crucial modification to the random walk process and consider: \[x_t=a+\rho x_{t-1}+e_t, \quad a \in \mathbb{R}, \quad |\rho|< 1.\] This process is called the Auto-Regressive model of order one and is written AR(1).

Iterating, we have: \[x_t = a + \rho (a + \rho x_{t-2}+e_{t-1})+e_{t}\] and, until infinity… \[x_t=a \sum_{k=0}^\infty\rho^k+ \sum_{k=0}^\infty \rho^ke_{t-k}=\frac{a}{1-\rho}+ \sum_{k=0}^\infty \rho^ke_{t-k},\] which explains why we need \(|\rho|<1\) (if not, the process explodes asymptotically).

2.2 Interpretation

The process’ current value is determined by past values plus a contemporary shock.

This makes AR(p) processes useful to model variables that evolve gradually, such as:

  • GDP
  • Inflation
  • Interest rates
  • Temperature
  • Asset prices.

In the AR(p) model, because of the memory, shocks can have a lasting effect. The higher the \(\rho\) parameter, the longer the memory.

2.3 Properties

Basic properties

We have \[\mathbb{E}[x_t]=\frac{a}{1-\rho}, \] and when innovations are independent, \[\mathbb{V}[x_t]=\mathbb{V}\left[\sum_{k=0}^\infty \rho^ke_{t-k} \right]=\sum_{k=0}^\infty \mathbb{V}\left[\rho^ke_{t-k} \right]=\frac{\sigma^2}{1-\rho^2}.\]

In this case, we see that the mean and variance do not depend on \(t\)! (obvious given the infinite series).

Moreover,

\[\mathbb{C}ov[x_t,x_{t+h}]=\mathbb{C}ov\left[\sum_{k=0}^\infty \rho^ke_{t-k} ,\sum_{j=0}^\infty \rho^je_{t-j+h} \right]=\sum_{k=0}^\infty \sum_{j=0}^\infty \mathbb{C}ov\left[ \rho^ke_{t-k}, \rho^je_{t-j+h} \right].\] Again, we need to focus on the cases when the time indices are equal, so \[\mathbb{C}ov[x_t,x_{t+h}]=\sum_{k=0}^\infty \sum_{j=h}^\infty \mathbb{C}ov\left[ \rho^ke_{t-k}, \rho^{j-h}e_{t-j} \right]=\sum_{k=0}^\infty \sum_{j=h}^\infty \rho^{k+j-h}\mathbb{C}ov[e_{t-k},e_{t-j} ].\]

Thus, for \(j=k\), we get (mind the change in index) \[\mathbb{C}ov[x_t,x_{t+h}]=\sigma^2 \sum_{j=h}^\infty \rho^{2j-h}=\sigma^2 \rho^h \sum_{j=0}^\infty \rho^{2j}=\sigma^2\frac{\rho^h}{1-\rho^2}.\] Again, this does not depend on the time index \(t\). If innovations are Gaussian, as Gaussian distributions are entirely defined by their first two moments, this means that the full (unconditional) distribution of the process is invariant. This very convenient property is called stationarity, and comes in various flavors (have a look on the net!). It is very suitable for analysis because now we know that the object that we study (the underlying distribution) is stable in time.

The above quantity, which measures the link between different realizations of the same variable (at different points in time) is very important. Often, the covariance is scaled by the variance to yield the auto-correlation of the process.

Note

Relatedly, there is the auto-correlogram, which is defined, for stationary series, as

\[\gamma(h)=\mathbb{C}orr[x_t,x_{t+h}].\]

The sample version is \[\hat{\gamma}(k) = \frac{\sum_{t=1}^{n-k}(x_t - \bar{x})(x_{t+k} - \bar{x})}{\sum_{t=1}^{n}(x_t - \bar{x})^2}.\]

For the AR(1), \(\gamma(h)=\rho^h\), a super simple form: as \(h\) increases, the correlation decreases following a power pattern.

More generally, autoregressive models of higher orders (AR(p)) have memory that goes back further in time: \[x_t=a+\sum_{k=1}^p\rho_k x_{t-k}+e_t, \quad a \in \mathbb{R},\]

Even more generally (see later on below), there are ARMA(p,q) processes, defined as \[x_t=a+\sum_{k=1}^p\rho_k x_{t-k} +\sum_{j=1}^q\delta_je_{t-j}+e_t, \quad a \in \mathbb{R},\]

but more in these is left to your curiosity.

2.4 Illustration (simulations)

Let’s simulate two such processes. We refer to the ARIMA page for details on the implementation. The order is \((p,d,q)\), like in the ARMA(p,q) process (the \(d\) is for integration, a notion out of the scope of this course, simply linked to the fact that the model can be differentiated if need be).

N <- 180
x_02 <- arima.sim(list(order = c(1,0,0), ar=.02), n = N)
x_98 <- arima.sim(list(order = c(1,0,0), ar=.98), n = N)
tibble(time = 1:N, x_02 = x_02, x_98 = x_98) |>
  pivot_longer(-time, names_to = "process", values_to = "values") |>
  ggplot(aes(x = time, y = values, color = process)) + geom_line(linewidth = 1) +
  theme_bw() +
  scale_color_manual(values = c("#FD9911", "#0066CC"))

One of the processes seems to be just white noise, the other evolves more slowly and oscillates much less.

Let’s have a look at the auto-correlogram of the series. In R, ACF means AutoCorrelation Function.

ggAcf(x_98) + theme_bw() + 
  geom_function(fun = function(x) 0.98^x, color = "#11DD33")

The values we extract from the sample are not exactly those predicted from the model. This comes from size of the sample, which is too small. As it increases, we would obtain a perfect fit (you can try!).

3 Moving averages

3.1 Definitions

Simple moving average

The SMA, in its backward-looking version: \[\text{MA}_t(X,N) = \frac{1}{N} \sum_{n=1}^N X_{t+1-n} . \]

But maybe some observations matter more than others.

Exponentially-weighted moving average

The EWMA, backward-looking: \[\text{EWMA}_t(X,N, \alpha) = \frac{1-\alpha}{N} \sum_{n=1}^N \alpha^{n-1}X_{t+1-n}, \quad \alpha \in (0,1). \] This gives more weight to the most recent observations. The scaling factor comes from \(\sum_{n=0}^\infty \alpha^n=(1-\alpha)^{-1}\).

The averages can also be forward-looking (with knowledge of the future): \[\text{MA}^{\text{fwd}}_t(X,N) = \frac{1}{N} \sum_{n=1}^N X_{t-1+n}. \]

Or sometimes they can even be centered (taking \(N/2\) before and \(N/2\) after). This depends on the context.

Updating is very easy for the EWMA (based on the previous value of EWMA):

\[\text{EWMA}_t(X,N, \alpha) = \alpha \text{EWMA}_{t-1}(X,N, \alpha) + (1-\alpha) X_t .\]

3.2 Examples

t <- 1:90
d <- t + 10*sin(t*10) + 10*rnorm(length(t))
MA_05 <- SMA(d, 5)
EWMA_10 <- EMA(d, 10, ratio = 0.5)
MA_15 <- SMA(d, 20)
data.frame(t, d, MA_05, EWMA_10, MA_15) |>
  pivot_longer(-t, values_to = "value", names_to = "series") |>
  ggplot(aes(x = t, y = value, color = series)) + geom_line(linewidth = 0.9) +
  theme_classic() + 
  theme(legend.position = c(0.2,0.8)) +
  scale_color_d3()

Note that when \(N\) is large, the averaging takes some time (i.e., points) and the MA starts later in the sample…

4 The MA(p)-process

4.1 Introduction

Definition

A Moving Average process of order \(q\), denoted MA(\(q\)), models a univariate time series \(\{x_t\}\) as a linear function of current and past error terms:

\[ x_t = \mu + u_t + \theta_1 u_{t-1} + \theta_2 u_{t-2} + \cdots + \theta_q u_{t-q}, \]

where:

  • \(\mu\) is the mean of the process,
  • \(e_t\) is a white-noise innovation with variance \(\sigma_u^2\),
  • \(\theta_1, \theta_2, \dots, \theta_q\) are MA coefficients.

The innovations are assumed to satisfy: \[e_t \sim (0, \sigma_u^2), \qquad \text{i.i.d.},\]

with: \[ \mathbb{E}[e_t e_{t-s}] = \begin{cases} \sigma_u^2, & s = 0, \\ 0, & s \neq 0. \end{cases} \]

\[\mathbb{E}\left[ x_tx_{t+h} \right] = \mathbb{E}\left[ \sum_{s=1}^q \theta_s u_{t-s} \sum_{r=1}^q \theta_r u_{t+h-r} \right]=\sum_{s=1}^q\sum_{r=1}^q \theta_s u_{t-s} \theta_r u_{t+h-r}\]

Hence, nonzero terms are those for which \(t-s=t+h-r\), i.e., \(s = r-h\).

In short, the autocovariance function of an MA(\(q\)) process is nonzero only up to lag \(q\):

\[ \gamma(s) = \begin{cases} \sigma_u^2 \left( 1 + \theta_1^2 + \cdots + \theta_q^2 \right), & s = 0, \\[6pt] \sigma_u^2 \left( \theta_s + \theta_1 \theta_{s+1} + \cdots + \theta_{q-s} \theta_q \right), & 1 \le s \le q, \\[6pt] 0, & s > q. \end{cases} \]

Thus, the MA(\(q\)) process has a finite autocorrelation structure. After a time that is beyond the maximum lag, there is no more (auto-)correlation.

4.2 Interpretation

The series is driven entirely by shocks. In an MA(q) process, the value of the time series at time \(t\) (\(x_t\)) is a weighted sum of current and past shocks. These shocks have temporary (not persistent) effects; after some time, they are “forgotten”. Hence, MA models have short memory.

They are useful when:

  • The series reacts sharply to random events (especially true in finance);
  • The effect dissipates quickly;
  • The autocorrelation function drops to zero after a few lags.

4.3 Simulation

We can either build the series manually (interesting intellectually)…

simulate_ma <- function(n, theta, sigma = 1) {
  q <- length(theta)
  epsilon <- rnorm(n + q, mean = 0, sd = sigma)
  y <- numeric(n)
  for (t in 1:n) {
    y[t] <- epsilon[t + q] + sum(theta * epsilon[(t + q - 1):(t)])
  }
  return(y)
}
ma_series_1 <- simulate_ma(n = N, theta = c(0.6, -0.4), sigma = 1)

or resort on the R internal function…

ma_series_2 <- arima.sim(model = list(ma = c(0.6, -0.4)), n = N)
data.frame(t = 1:N, x_1 = ma_series_1, x_2 = ma_series_2) |>
  pivot_longer(-t, names_to = "series", values_to = "value") |> 
  ggplot(aes(x = t, y = value, color = series)) + geom_line() +
  theme_classic() + 
  theme(legend.position = "top")

The series seem similar indeed

5 The ARMA(p,q) process

Basically, it’s a combination of AR(p) and MA(q)!

An Autoregressive Moving Average process of order \((p, q)\), denoted ARMA(\(p, q\)), combines both autoregressive and moving average components. It is defined as:

\[\begin{align} y_t &= c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + e_t + \theta_1 u_{t-1} + \theta_2 u_{t-2} + \cdots + \theta_q e_{t-q} \\ &= c + \sum_{i=1}^{p} \phi_i\, y_{t-i} + e_t + \sum_{j=1}^{q} \theta_j\, e_{t-j}. \end{align}\]

where:

  • \(y_t\) is the value of the time series at time \(t\);
  • \(c\) is a constant term;
  • \(\phi_1, \dots, \phi_p\) are autoregressive coefficients;
  • \(\theta_1, \dots, \theta_q\) are moving average coefficients;
  • \(e_t\) is a white-noise innovation with variance \(\sigma_u^2\).

The innovations satisfy \(e_t \sim (0, \sigma_u^2), \qquad \text{i.i.d.}\), with \[ \mathbb{E}[e_t e_{t-s}] = \begin{cases} \sigma_u^2, & s = 0, \\ 0, & s \neq 0. \end{cases} \]

Compact lag-operator representation

Using the lag operator \(L y_t = y_{t-1}\), an ARMA(\(p, q\)) process can be written as:

\[\phi(L) y_t = c + \theta(L) u_t,\]

with \(\phi(L) = 1 - \phi_1 L - \phi_2 L^2 - \cdots - \phi_p L^p,\) and \(\theta(L) = 1 + \theta_1 L + \theta_2 L^2 + \cdots + \theta_q L^q.\)

6 Predictive regressions

6.1 Introduction

PRs are somewhat out of the scope of time-series strictly speaking because they seek to forecast variables with other variables.
Namely, the simplest formulation is the following.

Definition

A (simple) predictive regression can be written as \[y_{t+1} = a+ bx_t + e_{t+1},\] where the error terms \(e_t\) satisfy the usual assumptions (iid white noise, etc.).

More general models are written

\[y_{t+k} = a+X_{t}B + e_t,\] where the difference is in the lag \(k\) and in the fact that there can be more than one predictor \(X\).

In inference, several problems can arise when, for instance \(x_t\) is itself auto-correlated. For a modern take on this issue, we refer to Biases in long-horizon predictive regressions.

6.2 In practice

  1. We estimate the model on past data, either with a simple OLS estimator \(\hat{B}=(X_T'X_T)^{-1}X_T'y_{T+k}\), or with a more robust estimator.
  2. We make a prediction with the current value of \(x_t\): \(\tilde{y}_{t+k}=X_t\hat{B}\).
  3. Evaluate the error (requires the realization of \(y_{t+k}\)).

7 Application to cryptocurrencies

For simplicity, we will work with daily data. Higher frequencies can be obtained outside the crypto space via the highfrequency package. Or playing with dedicated APIs (e.g. with packages: binance/coinmarketcap).

7.1 Fetching data

We will use the crypto2 package below. If need be: install it.

First, let’s have a look at the list of available coins… which are numerous!

coins <- crypto_list() 

You can have a look at the info for the coins via the commented code below.

c_info <- crypto_info(coin_list = coins, limit = 30)
❯ Scraping crypto info
❯ Processing crypto info

Next, we can download historical quotes.

coin_names <- c("Bitcoin", "Ethereum", "Tether USDt", "BNB", "Solana")
coin_hist <- crypto_history(coins |> dplyr::filter(name %in% coin_names),
                            start_date = "20170101",
                            end_date = "20251206")
❯ Scraping historical crypto data
❯ Processing historical crypto data

Coin Solana does not have data available! Cont to next coin.

Coin Solana does not have data available! Cont to next coin.
coin_hist <- coin_hist |>  # Timestamps are at midnight, hence we add a day.
  dplyr::mutate(date = 1 + as.Date(as.POSIXct(timestamp, origin="1970-01-01")))

And plot them. NOTE: mind the log-scale for the y-axis.
Also, we use plotly below for a better experience.

library(plotly)
g <- coin_hist |>
  ggplot(aes(x = date, y = close, color = name, label = symbol)) + 
  geom_line() +
  scale_y_log10() + theme_bw() + xlab("")
ggplotly(g)
coin_hist |>
  filter(symbol %in% c("USDT")) |>
  ggplot(aes(x = date, y = close, color = name, label = symbol)) + 
  geom_line() + theme_classic() +
  theme(legend.position = c(0.8,0.8))

Now, the problem with some of these series is that they are NOT stationary. At the beginning of the sample, their values are much lower to those at th end of the sample. This is one of the reasons why in finance, we look at returns: \[r_t=\frac{p_t-p_{t-1}}{p_{t-1}},\] which are simply relative price changes.
Let’s compute them below. We recall that prices are gathered in the close column.

coin_hist <- coin_hist |> 
  group_by(name) |>
  mutate(return = close / lag(close) - 1 ) |>
  ungroup()

And have a look at their distributions.

coin_hist |>
  ggplot(aes(x = return, fill = name)) + geom_histogram() +
  facet_wrap(vars(name), scales = "free") + theme_bw() +
  theme(axis.title = element_blank(),
        legend.position = c(0.85,0.2))

Are these distributions Gaussian?

There seem to be outliers (far to the left or far to the right of the most common values)!

7.2 Returns

First, let’s have a look a autocorrelograms (of returns).

crypto_acf <-
  coin_hist |> 
  na.omit() %>%
  split(.$name) %>% 
  map(~acf(.$return, plot = F)) %>% 
  map_dfr(
    ~data.frame(lag = .$lag, 
                acf = .$acf,
                ci = qnorm(0.975)/sqrt(.$n.used)
    ), .id = "name") 
crypto_acf |> 
  filter(lag > 0, lag < 6) |>
  ggplot(aes(x = lag, y = acf, fill = name)) + 
  geom_col(position = "dodge") + theme_bw()

But recall that some of the coins are stable!
Overall, however, there does not seem to be a strong memory pattern.
Hence, daily returns appear to be mostly white noise.

coin_data <- coin_hist |>
  na.omit() |>
  select(timestamp, slug, close) |> 
  distinct() |>  
  group_by(slug) |>
  mutate(return = close / lag(close) - 1) |>
  pivot_wider(names_from = slug, values_from = c(close, return)) 

7.3 Estimations

7.3.1 AR(p)

Ok, so now let’s try to link what we have seen in the theoretical part with the data. This step is called estimation.
Basically, if we think that the data follows \(x_t=bx_{t-1}+e_t\), we need to find a good value for \(b\) (the best possible)!
Let’s have a look at the data again…

coin_data |> ggplot(aes(x = timestamp, y = close_bitcoin)) + geom_line()

So this series is clearly not stationary. Hence we must resort to the differentiates ones (returns)…

btc_return <- coin_data$return_bitcoin |> na.omit() 
ar(btc_return)

Call:
ar(x = btc_return)

Coefficients:
      1        2  
-0.0275   0.0323  

Order selected 2  sigma^2 estimated as  0.001312

The function automatically determines the optimal number of lags (order of the process).
We can verify this manually (let’s try with only one lag…).

lm(btc_return ~ lag(btc_return))

Call:
lm(formula = btc_return ~ lag(btc_return))

Coefficients:
    (Intercept)  lag(btc_return)  
       0.002074        -0.028371  

Close! But we only have one lag…

lm(btc_return ~ lag(btc_return) + lag(btc_return, 2))

Call:
lm(formula = btc_return ~ lag(btc_return) + lag(btc_return, 2))

Coefficients:
       (Intercept)     lag(btc_return)  lag(btc_return, 2)  
          0.001975           -0.027935            0.032315  

The second order is ok but we have a small shift in the first order coefficient…

7.4 ARMA(p,q)

Super simple…

arma(btc_return)

Call:
arma(x = btc_return)

Coefficient(s):
      ar1        ma1  intercept  
  -0.2814     0.2479     0.0026  

and easy to interpret, too!

7.5 First step towards predictions & backtest

Time-series is all about forecasting!
And remember, “forecasting is difficult especially about the future”. Following the ML parlance, we must differentiate between two key phases:

  • estimation/calibration, training, in-sample models;
  • testing, out-of-sample evaluation.

In ML there are in fact other stages (like validation/tuning, post-training, etc.), but here, let’s keep it simple…

We first split the sample in two.

split_date <- "2022-06-06"
btc_train <- coin_data |> filter(timestamp <= split_date) |> na.omit() |> pull(return_bitcoin)
btc_test <- coin_data |> filter(timestamp > split_date) |> na.omit() |> pull(return_bitcoin)

Then, we train/estimate an ARMA model.
We use the arima function here And make predictions.

fit <- arima(btc_train)
pred <- predict(fit, n.ahead = length(btc_test))
head(pred$pred)
Time Series:
Start = 786 
End = 791 
Frequency = 1 
[1] 0.00255155 0.00255155 0.00255155 0.00255155 0.00255155 0.00255155

Ok Houston, we have a problem… the predictions are constant. But why?

7.6 Predictive regressions

Let’s try to predict ETH returns with BTC returns of the previous period (day)

eth_train <- coin_data |> filter(timestamp <= split_date) |> na.omit() |> pull(return_ethereum)
eth_test <- coin_data |> filter(timestamp > split_date) |> na.omit() |> pull(return_ethereum)
fit <- lm(eth_train ~ lag(btc_train))
pred <- predict(fit, xnew = btc_test)
head(pred)
          2           3           4           5           6           7 
0.004007372 0.004628964 0.004309228 0.004831152 0.003011422 0.004354675 

In this case, the predictions differ because they rely on a exogenous value for the BTC return…

How good are these predictions…

The RMSE:

sqrt(mean((pred - eth_test)^2))
[1] 0.03629486

Not necessarily informative… (more on that next time).

The Hit Ratio = the proportion of time that the prediction is in the right direction:

mean(pred*eth_test>=0)
[1] 0.5

Not so great !!! But at the same time, the model is really simple (we did not put a lot of energy into it).
Moreover, it is stale

8 References

8.1 For time-series

There many resources available online. We single out two below:

8.2 For cryptocurrencies